Airbnb Analytics
Initiation
We will be analyzing AirBnB listing data to create a model to predict the total cost for two people staying 4 nights in AirBnB in Mexico City.
# download the data
listings <- vroom::vroom("http://data.insideairbnb.com/mexico/df/mexico-city/2020-06-20/data/listings.csv.gz")
Exploratory Data Analysis
First we will conduct an Exploratory Data Analysis to better understand our data, understanding the number of columns, rows, type of observations. We will also create scatterplots to better understand correlation between different variables.
Looking at the raw data
# have an initial look at the data
glimpse(listings)
## Rows: 21,824
## Columns: 106
## $ id <dbl> 35797, 56074, 61792, 706…
## $ listing_url <chr> "https://www.airbnb.com/…
## $ scrape_id <dbl> 2.02e+13, 2.02e+13, 2.02…
## $ last_scraped <date> 2020-06-23, 2020-06-26,…
## $ name <chr> "Villa Dante", "Great sp…
## $ summary <chr> "Dentro de Villa un estu…
## $ space <chr> "please go to (URL HIDDE…
## $ description <chr> "Dentro de Villa un estu…
## $ experiences_offered <chr> "none", "none", "none", …
## $ neighborhood_overview <chr> "Centro comercial Santa …
## $ notes <chr> "Si te gustan la tipo ha…
## $ transit <chr> "Uber es buena opción o …
## $ access <chr> "Jardin muy Amplio.", NA…
## $ interaction <chr> "Cualquier duda contácte…
## $ house_rules <chr> "Se renta un estudio de…
## $ thumbnail_url <lgl> NA, NA, NA, NA, NA, NA, …
## $ medium_url <lgl> NA, NA, NA, NA, NA, NA, …
## $ picture_url <chr> "https://a0.muscache.com…
## $ xl_picture_url <lgl> NA, NA, NA, NA, NA, NA, …
## $ host_id <dbl> 153786, 265650, 299558, …
## $ host_url <chr> "https://www.airbnb.com/…
## $ host_name <chr> "Dici", "Maris", "Robert…
## $ host_since <date> 2010-06-28, 2010-10-19,…
## $ host_location <chr> "Mexico City, Mexico Cit…
## $ host_about <chr> "Master in visual arts, …
## $ host_response_time <chr> "N/A", "within an hour",…
## $ host_response_rate <chr> "N/A", "100%", "100%", "…
## $ host_acceptance_rate <chr> "N/A", "91%", "67%", "10…
## $ host_is_superhost <lgl> FALSE, TRUE, FALSE, TRUE…
## $ host_thumbnail_url <chr> "https://a0.muscache.com…
## $ host_picture_url <chr> "https://a0.muscache.com…
## $ host_neighbourhood <chr> NA, "San Rafael", "Conde…
## $ host_listings_count <dbl> 2, 2, 1, 4, 3, 3, 4, 2, …
## $ host_total_listings_count <dbl> 2, 2, 1, 4, 3, 3, 4, 2, …
## $ host_verifications <chr> "['email', 'phone', 'rev…
## $ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ host_identity_verified <lgl> FALSE, FALSE, FALSE, TRU…
## $ street <chr> "Mexico City, D.f., Mexi…
## $ neighbourhood <chr> NA, "San Rafael", "Conde…
## $ neighbourhood_cleansed <chr> "Cuajimalpa de Morelos",…
## $ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA, …
## $ city <chr> "Mexico City", "Mexico C…
## $ state <chr> "D.f.", "DF", "Ciudad de…
## $ zipcode <chr> NA, NA, "06140", "04100"…
## $ market <chr> "Mexico City", "Mexico C…
## $ smart_location <chr> "Mexico City, Mexico", "…
## $ country_code <chr> "MX", "MX", "MX", "MX", …
## $ country <chr> "Mexico", "Mexico", "Mex…
## $ latitude <dbl> 19.4, 19.4, 19.4, 19.4, …
## $ longitude <dbl> -99.3, -99.2, -99.2, -99…
## $ is_location_exact <lgl> FALSE, TRUE, TRUE, TRUE,…
## $ property_type <chr> "Villa", "Condominium", …
## $ room_type <chr> "Entire home/apt", "Enti…
## $ accommodates <dbl> 2, 3, 2, 2, 2, 2, 14, 2,…
## $ bathrooms <dbl> 1.0, 1.0, 1.0, 1.0, 1.5,…
## $ bedrooms <dbl> 1, 1, 1, 1, 1, 1, 4, 1, …
## $ beds <dbl> 1, 2, 1, 1, 1, 1, 10, 0,…
## $ bed_type <chr> "Futon", "Real Bed", "Re…
## $ amenities <chr> "{Wifi,Kitchen,\"Free pa…
## $ square_feet <dbl> 32292, 646, 161, NA, 155…
## $ price <chr> "$4,500.00", "$843.00", …
## $ weekly_price <chr> NA, "$4,740.00", NA, "$9…
## $ monthly_price <chr> "$124,995.00", "$15,724.…
## $ security_deposit <chr> NA, "$2,279.00", "$11,32…
## $ cleaning_fee <chr> NA, "$684.00", "$340.00"…
## $ guests_included <dbl> 1, 2, 2, 2, 1, 1, 6, 1, …
## $ extra_people <chr> "$0.00", "$342.00", "$11…
## $ minimum_nights <dbl> 1, 4, 2, 6, 4, 1, 2, 4, …
## $ maximum_nights <dbl> 7, 150, 21, 180, 365, 73…
## $ minimum_minimum_nights <dbl> 1, 4, 2, 6, 4, 1, 2, 4, …
## $ maximum_minimum_nights <dbl> 1, 4, 2, 6, 4, 1, 2, 4, …
## $ minimum_maximum_nights <dbl> 7, 1125, 21, 180, 365, 7…
## $ maximum_maximum_nights <dbl> 7, 1125, 21, 180, 365, 7…
## $ minimum_nights_avg_ntm <dbl> 1.0, 4.0, 2.0, 6.0, 4.0,…
## $ maximum_nights_avg_ntm <dbl> 7, 1125, 21, 180, 365, 7…
## $ calendar_updated <chr> "35 months ago", "4 week…
## $ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ availability_30 <dbl> 23, 0, 30, 28, 0, 0, 25,…
## $ availability_60 <dbl> 53, 0, 60, 58, 19, 0, 45…
## $ availability_90 <dbl> 83, 0, 90, 88, 49, 18, 7…
## $ availability_365 <dbl> 358, 0, 180, 363, 319, 1…
## $ calendar_last_scraped <date> 2020-06-23, 2020-06-26,…
## $ number_of_reviews <dbl> 0, 60, 52, 102, 10, 0, 2…
## $ number_of_reviews_ltm <dbl> 0, 2, 1, 10, 2, 0, 11, 1…
## $ first_review <date> NA, 2017-11-18, 2017-11…
## $ last_review <date> NA, 2019-07-24, 2019-11…
## $ review_scores_rating <dbl> NA, 97, 98, 98, 100, NA,…
## $ review_scores_accuracy <dbl> NA, 10, 10, 10, 10, NA, …
## $ review_scores_cleanliness <dbl> NA, 10, 10, 10, 10, NA, …
## $ review_scores_checkin <dbl> NA, 10, 10, 10, 10, NA, …
## $ review_scores_communication <dbl> NA, 10, 10, 10, 10, NA, …
## $ review_scores_location <dbl> NA, 10, 10, 10, 10, NA, …
## $ review_scores_value <dbl> NA, 10, 10, 10, 10, NA, …
## $ requires_license <lgl> FALSE, FALSE, FALSE, FAL…
## $ license <lgl> NA, NA, NA, NA, NA, NA, …
## $ jurisdiction_names <chr> "{\"Mexico City\",\" MX …
## $ instant_bookable <lgl> FALSE, TRUE, FALSE, FALS…
## $ is_business_travel_ready <lgl> FALSE, FALSE, FALSE, FAL…
## $ cancellation_policy <chr> "flexible", "moderate", …
## $ require_guest_profile_picture <lgl> FALSE, FALSE, FALSE, FAL…
## $ require_guest_phone_verification <lgl> FALSE, FALSE, FALSE, FAL…
## $ calculated_host_listings_count <dbl> 1, 2, 2, 3, 2, 3, 4, 2, …
## $ calculated_host_listings_count_entire_homes <dbl> 1, 2, 0, 2, 2, 1, 2, 0, …
## $ calculated_host_listings_count_private_rooms <dbl> 0, 0, 2, 1, 0, 2, 2, 2, …
## $ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reviews_per_month <dbl> NA, 1.89, 1.62, 1.00, 0.…
Before starting with the actual data cleaning, let’s visualize which variables have a lot of missing values.
# identify missing values
plot_missing(listings,group = list(Good = 0.05, OK = 0.4, Bad = 0.8, Remove = 1), geom_label_args = list(),
title = NULL,
ggtheme = theme_gray(), theme_config = list(legend.position = c("bottom")))

We will now look at the type of data to understand the kind of variables and the data type. Ensuring that quantitaive variables are stored as numeric data and dealing with missing values.
Transform data type
# transform variables into correct data types
listings_wip <- listings %>%
# remove dollar signs and transform them to numeric data
mutate(price = parse_number(price)) %>%
mutate(cleaning_fee = parse_number(cleaning_fee)) %>%
mutate(extra_people = parse_number(extra_people)) %>%
# turn character into factor
mutate(room_type = fct_relevel(room_type,
"Shared room",
"Private room",
"Entire home/apt")) %>%
# turn character into factor
mutate(cancellation_policy = fct_relevel(cancellation_policy,
"flexible",
"moderate",
"strict_14_with_grace_period",
"super_strict_30",
"super_strict_60"))
Deal with missing values: cleaning_fee
## number of NAs
sum(is.na(listings_wip$cleaning_fee))
## setting NAs to 0, because NA means no cleaning was booked
listings_wip$cleaning_fee[is.na(listings_wip$cleaning_fee)] <- 0
Simplify variables:
# Property type
## what are the most frequent types
listings_wip %>%
count(property_type, sort = TRUE)
## # A tibble: 37 x 2
## property_type n
## <chr> <int>
## 1 Apartment 13545
## 2 House 3289
## 3 Condominium 1577
## 4 Loft 1120
## 5 Guest suite 571
## 6 Serviced apartment 369
## 7 Guesthouse 202
## 8 Boutique hotel 195
## 9 Hostel 177
## 10 Bed and breakfast 174
## # … with 27 more rows
# visualize it with ggplot
ggplot(listings_wip)+
geom_bar(aes(x=fct_infreq(property_type)))+
coord_flip() +
labs(title = "Apartments clearly being the most common property type",
subtitle = "Most frequent property types",
x = "Property type",
y = "Count") +
theme_bw()

# simplify variable
listings_wip <- listings_wip %>%
mutate(prop_type_simplified = case_when(
property_type %in% c("Apartment","House", "Condominium","Loft") ~ property_type,
TRUE ~ "Other"
))
# check if simplification worked
listings_wip %>%
count(property_type, prop_type_simplified) %>%
arrange(desc(n))
## # A tibble: 37 x 3
## property_type prop_type_simplified n
## <chr> <chr> <int>
## 1 Apartment Apartment 13545
## 2 House House 3289
## 3 Condominium Condominium 1577
## 4 Loft Loft 1120
## 5 Guest suite Other 571
## 6 Serviced apartment Other 369
## 7 Guesthouse Other 202
## 8 Boutique hotel Other 195
## 9 Hostel Other 177
## 10 Bed and breakfast Other 174
## # … with 27 more rows
# Minimum nights
## what are the most common values
listings_wip2 <- listings_wip %>%
count(minimum_nights, sort = TRUE)
## only include listings with minimum nights of 4
listings_wip3 <- listings_wip2 %>%
filter(minimum_nights <= 4)
# Display result
listings_wip3 %>%
kbl(col.names =c("Minimum Nights","Count") ) %>%
kable_material(c("striped", "hover")) %>%
kable_styling(fixed_thead = T)
| Minimum Nights | Count |
|---|---|
| 1 | 10088 |
| 2 | 6474 |
| 3 | 2376 |
| 4 | 490 |
Which value stands out & what is probably the purpose of listings with such minimum requirements?
ANSWER: The minimum nights thresholds of 15, 30, 180 and 365 days stand out since they represent a large, but clearly defined period within a year (half a month, one month, half a year, one year). While many listings on Airbnb are peoples’ private apartments which they rent out occasionally for a few days, properties with the minimum requirements mentioned above appear to have been built only for the purpose of renting. An example might be an apartment close to a big city which is targeting young professionals that work there for a limited time period. Therefore, those listings are not targeting the typical weekend trip tourists that are looking for a short stay.
Summary statistics of variables of interests
To simplify a few steps later on, we are defining lists of variables of interest.
# select variables of interest
var_of_interest_quant = c("id", "price", "cleaning_fee", "extra_people", "number_of_reviews", "review_scores_rating")
var_of_interest_qual = c("property_type", "room_type", "neighbourhood")
var_of_interest = c(var_of_interest_quant, var_of_interest_quant)
To begin, let’s pick out a few particularly interesting variables and briefly look at their statistics.
# compute summary statistics and look for missing values (NA)
skim(listings_wip)
| Name | listings_wip |
| Number of rows | 21824 |
| Number of columns | 107 |
| _______________________ | |
| Column type frequency: | |
| character | 43 |
| Date | 5 |
| factor | 2 |
| logical | 15 |
| numeric | 42 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 34 | 37 | 0 | 21824 | 0 |
| name | 8 | 1.00 | 1 | 255 | 0 | 21094 | 0 |
| summary | 1443 | 0.93 | 1 | 1000 | 0 | 18458 | 0 |
| space | 6009 | 0.72 | 1 | 1000 | 0 | 14005 | 0 |
| description | 1141 | 0.95 | 1 | 1000 | 0 | 19519 | 0 |
| experiences_offered | 0 | 1.00 | 4 | 4 | 0 | 1 | 0 |
| neighborhood_overview | 6312 | 0.71 | 1 | 1000 | 0 | 12861 | 0 |
| notes | 13372 | 0.39 | 1 | 1000 | 0 | 7097 | 0 |
| transit | 7254 | 0.67 | 1 | 1000 | 0 | 12094 | 0 |
| access | 10332 | 0.53 | 1 | 1000 | 0 | 9843 | 0 |
| interaction | 7743 | 0.65 | 1 | 1000 | 0 | 11412 | 0 |
| house_rules | 9438 | 0.57 | 1 | 1000 | 0 | 10357 | 0 |
| picture_url | 0 | 1.00 | 35 | 146 | 0 | 21216 | 0 |
| host_url | 0 | 1.00 | 38 | 43 | 0 | 13139 | 0 |
| host_name | 0 | 1.00 | 1 | 35 | 0 | 4149 | 0 |
| host_location | 75 | 1.00 | 1 | 104 | 0 | 625 | 0 |
| host_about | 8718 | 0.60 | 1 | 5443 | 0 | 7251 | 8 |
| host_response_time | 0 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_response_rate | 0 | 1.00 | 2 | 4 | 0 | 52 | 0 |
| host_acceptance_rate | 0 | 1.00 | 2 | 4 | 0 | 81 | 0 |
| host_thumbnail_url | 0 | 1.00 | 55 | 106 | 0 | 13103 | 0 |
| host_picture_url | 0 | 1.00 | 57 | 109 | 0 | 13103 | 0 |
| host_neighbourhood | 9299 | 0.57 | 3 | 39 | 0 | 171 | 0 |
| host_verifications | 0 | 1.00 | 2 | 161 | 0 | 304 | 0 |
| street | 0 | 1.00 | 10 | 166 | 0 | 528 | 0 |
| neighbourhood | 4894 | 0.78 | 4 | 33 | 0 | 54 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 7 | 22 | 0 | 16 | 0 |
| city | 30 | 1.00 | 2 | 146 | 0 | 270 | 0 |
| state | 150 | 0.99 | 2 | 38 | 0 | 120 | 0 |
| zipcode | 1172 | 0.95 | 4 | 31 | 0 | 865 | 0 |
| market | 25 | 1.00 | 8 | 21 | 0 | 5 | 0 |
| smart_location | 0 | 1.00 | 6 | 154 | 0 | 291 | 0 |
| country_code | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
| country | 0 | 1.00 | 6 | 6 | 0 | 1 | 0 |
| property_type | 0 | 1.00 | 3 | 23 | 0 | 37 | 0 |
| bed_type | 4 | 1.00 | 5 | 13 | 0 | 5 | 0 |
| amenities | 0 | 1.00 | 2 | 1714 | 0 | 20546 | 0 |
| weekly_price | 20852 | 0.04 | 6 | 11 | 0 | 573 | 0 |
| monthly_price | 20873 | 0.04 | 7 | 13 | 0 | 603 | 0 |
| security_deposit | 9696 | 0.56 | 5 | 11 | 0 | 671 | 0 |
| calendar_updated | 0 | 1.00 | 5 | 14 | 0 | 88 | 0 |
| jurisdiction_names | 331 | 0.98 | 18 | 110 | 0 | 17 | 0 |
| prop_type_simplified | 0 | 1.00 | 4 | 11 | 0 | 5 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2020-06-20 | 2020-06-26 | 2020-06-21 | 7 |
| host_since | 0 | 1.00 | 2009-02-03 | 2020-06-16 | 2016-11-01 | 2947 |
| calendar_last_scraped | 0 | 1.00 | 2020-06-20 | 2020-06-26 | 2020-06-21 | 7 |
| first_review | 5371 | 0.75 | 2011-07-28 | 2020-06-22 | 2018-12-04 | 2113 |
| last_review | 5371 | 0.75 | 2013-12-21 | 2020-06-25 | 2020-03-03 | 1287 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| room_type | 0 | 1 | FALSE | 4 | Ent: 10985, Pri: 10153, Sha: 388, Hot: 298 |
| cancellation_policy | 1 | 1 | FALSE | 5 | fle: 11310, mod: 6015, str: 4467, sup: 28 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| thumbnail_url | 21824 | 0 | NaN | : |
| medium_url | 21824 | 0 | NaN | : |
| xl_picture_url | 21824 | 0 | NaN | : |
| host_is_superhost | 0 | 1 | 0.34 | FAL: 14359, TRU: 7465 |
| host_has_profile_pic | 0 | 1 | 1.00 | TRU: 21784, FAL: 40 |
| host_identity_verified | 0 | 1 | 0.30 | FAL: 15266, TRU: 6558 |
| neighbourhood_group_cleansed | 21824 | 0 | NaN | : |
| is_location_exact | 0 | 1 | 0.83 | TRU: 18037, FAL: 3787 |
| has_availability | 0 | 1 | 1.00 | TRU: 21824 |
| requires_license | 0 | 1 | 0.00 | FAL: 21824 |
| license | 21824 | 0 | NaN | : |
| instant_bookable | 0 | 1 | 0.57 | TRU: 12471, FAL: 9353 |
| is_business_travel_ready | 0 | 1 | 0.00 | FAL: 21824 |
| require_guest_profile_picture | 0 | 1 | 0.01 | FAL: 21665, TRU: 159 |
| require_guest_phone_verification | 0 | 1 | 0.01 | FAL: 21666, TRU: 158 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.84e+07 | 1.13e+07 | 3.58e+04 | 2.00e+07 | 3.02e+07 | 3.86e+07 | 4.39e+07 | ▂▃▅▅▇ |
| scrape_id | 0 | 1.00 | 2.02e+13 | 0.00e+00 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | ▁▁▇▁▁ |
| host_id | 0 | 1.00 | 1.22e+08 | 9.68e+07 | 7.36e+03 | 3.75e+07 | 1.02e+08 | 1.94e+08 | 3.50e+08 | ▇▅▃▃▂ |
| host_listings_count | 0 | 1.00 | 3.04e+01 | 2.82e+02 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 5.00e+00 | 3.33e+03 | ▇▁▁▁▁ |
| host_total_listings_count | 0 | 1.00 | 3.04e+01 | 2.82e+02 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 5.00e+00 | 3.33e+03 | ▇▁▁▁▁ |
| latitude | 0 | 1.00 | 1.94e+01 | 5.00e-02 | 1.92e+01 | 1.94e+01 | 1.94e+01 | 1.94e+01 | 1.96e+01 | ▁▁▅▇▁ |
| longitude | 0 | 1.00 | -9.92e+01 | 4.00e-02 | -9.93e+01 | -9.92e+01 | -9.92e+01 | -9.92e+01 | -9.90e+01 | ▁▁▇▁▁ |
| accommodates | 0 | 1.00 | 3.05e+00 | 2.21e+00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 4.00e+00 | 5.00e+01 | ▇▁▁▁▁ |
| bathrooms | 26 | 1.00 | 1.40e+00 | 1.00e+00 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.50e+00 | 5.00e+01 | ▇▁▁▁▁ |
| bedrooms | 48 | 1.00 | 1.42e+00 | 1.14e+00 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 5.00e+01 | ▇▁▁▁▁ |
| beds | 274 | 0.99 | 1.83e+00 | 1.75e+00 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 5.00e+01 | ▇▁▁▁▁ |
| square_feet | 21758 | 0.00 | 1.07e+03 | 3.99e+03 | 0.00e+00 | 0.00e+00 | 2.37e+02 | 8.40e+02 | 3.23e+04 | ▇▁▁▁▁ |
| price | 0 | 1.00 | 1.48e+03 | 4.99e+03 | 0.00e+00 | 4.08e+02 | 7.25e+02 | 1.32e+03 | 3.50e+05 | ▇▁▁▁▁ |
| cleaning_fee | 0 | 1.00 | 2.38e+02 | 4.58e+02 | 0.00e+00 | 0.00e+00 | 1.12e+02 | 3.50e+02 | 2.40e+04 | ▇▁▁▁▁ |
| guests_included | 0 | 1.00 | 1.62e+00 | 1.34e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 2.60e+01 | ▇▁▁▁▁ |
| extra_people | 0 | 1.00 | 1.23e+02 | 2.69e+02 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 2.00e+02 | 6.84e+03 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 4.09e+00 | 2.24e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 1.12e+03 | ▇▁▁▁▁ |
| maximum_nights | 0 | 1.00 | 6.86e+02 | 7.08e+02 | 1.00e+00 | 4.50e+01 | 1.12e+03 | 1.12e+03 | 5.00e+04 | ▇▁▁▁▁ |
| minimum_minimum_nights | 0 | 1.00 | 3.89e+00 | 1.96e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 1.12e+03 | ▇▁▁▁▁ |
| maximum_minimum_nights | 0 | 1.00 | 4.20e+00 | 2.26e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 1.12e+03 | ▇▁▁▁▁ |
| minimum_maximum_nights | 0 | 1.00 | 8.47e+02 | 6.70e+02 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 5.00e+04 | ▇▁▁▁▁ |
| maximum_maximum_nights | 0 | 1.00 | 8.49e+02 | 6.69e+02 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 5.00e+04 | ▇▁▁▁▁ |
| minimum_nights_avg_ntm | 0 | 1.00 | 4.04e+00 | 2.06e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 1.12e+03 | ▇▁▁▁▁ |
| maximum_nights_avg_ntm | 0 | 1.00 | 8.48e+02 | 6.69e+02 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 5.00e+04 | ▇▁▁▁▁ |
| availability_30 | 0 | 1.00 | 1.95e+01 | 1.24e+01 | 0.00e+00 | 1.00e+00 | 2.60e+01 | 3.00e+01 | 3.00e+01 | ▃▁▁▂▇ |
| availability_60 | 0 | 1.00 | 4.21e+01 | 2.38e+01 | 0.00e+00 | 2.20e+01 | 5.60e+01 | 6.00e+01 | 6.00e+01 | ▂▁▁▁▇ |
| availability_90 | 0 | 1.00 | 6.54e+01 | 3.47e+01 | 0.00e+00 | 4.90e+01 | 8.50e+01 | 9.00e+01 | 9.00e+01 | ▂▁▁▁▇ |
| availability_365 | 0 | 1.00 | 2.20e+02 | 1.39e+02 | 0.00e+00 | 8.90e+01 | 2.11e+02 | 3.61e+02 | 3.65e+02 | ▃▂▃▁▇ |
| number_of_reviews | 0 | 1.00 | 2.36e+01 | 4.18e+01 | 0.00e+00 | 1.00e+00 | 6.00e+00 | 2.80e+01 | 5.55e+02 | ▇▁▁▁▁ |
| number_of_reviews_ltm | 0 | 1.00 | 9.18e+00 | 1.50e+01 | 0.00e+00 | 0.00e+00 | 2.00e+00 | 1.20e+01 | 1.77e+02 | ▇▁▁▁▁ |
| review_scores_rating | 5616 | 0.74 | 9.51e+01 | 8.64e+00 | 2.00e+01 | 9.40e+01 | 9.70e+01 | 1.00e+02 | 1.00e+02 | ▁▁▁▁▇ |
| review_scores_accuracy | 5632 | 0.74 | 9.72e+00 | 8.50e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_cleanliness | 5632 | 0.74 | 9.59e+00 | 9.20e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_checkin | 5635 | 0.74 | 9.82e+00 | 7.20e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_communication | 5631 | 0.74 | 9.78e+00 | 7.90e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_location | 5636 | 0.74 | 9.81e+00 | 6.80e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_value | 5637 | 0.74 | 9.58e+00 | 9.00e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| calculated_host_listings_count | 0 | 1.00 | 6.09e+00 | 1.60e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 4.00e+00 | 1.57e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 4.09e+00 | 1.55e+01 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 1.57e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 1.77e+00 | 4.20e+00 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 4.80e+01 | ▇▁▁▁▁ |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 9.00e-02 | 8.30e-01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 2.00e+01 | ▇▁▁▁▁ |
| reviews_per_month | 5371 | 0.75 | 1.36e+00 | 1.50e+00 | 1.00e-02 | 3.00e-01 | 8.20e-01 | 1.92e+00 | 1.47e+01 | ▇▁▁▁▁ |
#Output in kable format
favstats(listings_wip$price) %>%
kbl(caption = "Listing Price", col.names = c("Minimum", "Q1","Median", "Q3", "Maximum", "Mean", "SD", "Count", "Missing")) %>%
kable_material(c("striped", "hover")) %>%
kable_styling(fixed_thead = T)
| Minimum | Q1 | Median | Q3 | Maximum | Mean | SD | Count | Missing | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 408 | 725 | 1316 | 349990 | 1484 | 4987 | 21824 | 0 |
favstats(listings_wip$cleaning_fee) %>%
kbl(caption = "Cleaning Fee", col.names = c("Minimum", "Q1","Median", "Q3", "Maximum", "Mean", "SD", "Count", "Missing")) %>%
kable_material(c("striped", "hover")) %>%
kable_styling(fixed_thead = T)
| Minimum | Q1 | Median | Q3 | Maximum | Mean | SD | Count | Missing | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 112 | 350 | 24000 | 238 | 458 | 21824 | 0 |
favstats(listings_wip$review_scores_rating) %>%
kbl(caption = "Scores Rating", col.names = c("Minimum", "Q1","Median", "Q3", "Maximum", "Mean", "SD", "Count", "Missing")) %>%
kable_material(c("striped", "hover")) %>%
kable_styling(fixed_thead = T)
| Minimum | Q1 | Median | Q3 | Maximum | Mean | SD | Count | Missing | |
|---|---|---|---|---|---|---|---|---|---|
| 20 | 94 | 97 | 100 | 100 | 95.1 | 8.64 | 16208 | 5616 |
Creating informative visualizations
By creating infromative visuals we will be able to better understand the relationships between different variables in our dataframe and help get a better picture.
Quantitative variables
Let’s look how the selected quantitative variables are distributed.
Observations
We may see that the distribution of review ratings has a high concentration on scores between 95 and 98 and is strongly skewed to the left, which might indicate that people are more likely to give positive feedback. While for other variables, we may see after adjusting the x-axis to log10 scale, cleaning fees, extra people and price show approximately normal distribution, although there are some outliers in prices which are extremely large due to the longer period of rent. The number of reviews is more uniformly distributed.
# visualizing the quantitative variables
quant_plot1 <- listings_wip %>%
select(var_of_interest_quant) %>%
pivot_longer(cols=2:6, names_to="Category", values_to="Value")
# create ggplot with facet wrap to have each variable next to each other
ggplot(quant_plot1, aes(x=Value)) +
geom_histogram() +
facet_wrap(~Category) +
scale_x_log10() +
labs(title = "Review scores stand out - strongly skewed to the left",
subtitle = "Distribution of selected quantitative variables",
x = "Log10",
y = "Count") +
theme_bw()

Let’s look at the relationship between a few selected variables.
1. Relationship between price and review score rating
As per the below Scatter plot it is evident that the price of the stay and the review ratings have a slightly positive correlation. However, we must note that most of the listings have a rating score of 70 or higher.
# select relevant variables
quant_plot2 <- listings_wip %>%
select(id, price, bedrooms, number_of_reviews, minimum_nights, review_scores_rating)
# plot price - rating relationship
ggplot(quant_plot2, aes(x=review_scores_rating, y=price)) +
geom_point() +
scale_y_log10() +
labs(title = "Majority of listing have rating above 70 - afterwards slightly positive correlation",
subtitle = "Relationship between price and review score rating",
x = "Review score rating",
y = "Price (log10)") +
theme_bw()

2. Relationship between price and number of reviews
The plot below demonstrates that there are relatively fewer number of ratings for expensive listings, this could be due to fewer people being able and/or willing to spend the amount required to stay at those properties.
# plot price - # of reviews relationship
ggplot(quant_plot2, aes(x=number_of_reviews, y=price)) +
geom_point() +
scale_y_log10() +
labs(title = "Intuitive relationship - only few can afford/ review very expensive listings",
subtitle = "Relationship between price and number of reviews",
x = "Number of reviews",
y = "Price (log10)") +
theme_bw()

3. Relationship between the number of reviews and the rating
Upon analyzing the relationship between number of reviews and rating score it is evident that as more people review for a property, it is more likely to be highly rated.
# plot number of reviews - review rating score
ggplot(quant_plot2, aes(x=number_of_reviews, y=review_scores_rating)) +
geom_point() +
scale_y_log10() +
scale_x_log10() +
labs(title = "Increasing number of reviews reduces variance in ratings - the more the better",
subtitle = "Relationship between the number of reviews and the rating",
x = "Number of reviews (log10)",
y = "Rating (log10)") +
theme_bw()

Qualitative variables
Let’s have a first look again on the property type and prices per night.
Observation
From the graph, we may see among all properties in Mexico, apartments are the most common type with more than 12000 properties collected and lofts are the least common with only around 1200. This might be because an apartment has better room layout with convenient facilities and relatively acceptable price and thus it is more popular.
We may also see that loft has the highest median price per night among all types, which might also be the reasons for low quantity demanded. While apartment has a medium reasonable price, it is surprising to notice that median price per night for house is lower than that for apartment. One of the reasons might be houses are normally located at rural area and apartments are more central.
# plot the different property types
ggplot(listings_wip, aes(x=fct_infreq(prop_type_simplified))) +
geom_bar() +
labs(title = "Appartment by far the most common property type on Airbnb in Mexico",
subtitle = "Property types in descending order",
x = "Property types",
y = "Count") +
theme_bw()

# look at price levels for each type
property_price <- listings_wip %>%
group_by(prop_type_simplified) %>%
summarise(avg_price = median(price))
# plot median prices for each property type
ggplot(property_price, aes(x=reorder(prop_type_simplified, -avg_price), y=avg_price)) +
geom_bar(stat="identity") +
labs(title = "House cheaper than appartment - probably due to less central location",
subtitle = "Median price by property type in descending order",
x = "Property types",
y = "Median price MXN per night") +
theme_bw()

Let’s then analyse the room types and prices per night.
Observation
When comparing hotel rooms with entire apartments, we might see the number of entire apartments are nearly 30 times higher than the number of hotel rooms. This might be because that hotels are more likely to be posted on other hotel websites instead of Airbnb.
And also, hotel rooms have the highest median price per night (more than $1300) among all types of properties, with shared rooms being the cheapest. This is intuitively correct as shared rooms tend to be less convenient and luxurious than hotel rooms.
listings_wip %>%
count(room_type, sort = TRUE)
## # A tibble: 4 x 2
## room_type n
## <fct> <int>
## 1 Entire home/apt 10985
## 2 Private room 10153
## 3 Shared room 388
## 4 Hotel room 298
# plot the different property types
ggplot(listings_wip, aes(x=fct_infreq(room_type))) +
geom_bar() +
labs(title = "Entire apartment by far the most common room type",
subtitle = "Room types in descending order",
x = "Room types",
y = "Count") +
theme_bw()

# look at price levels for each type
room_price <- listings_wip %>%
group_by(room_type) %>%
summarise(avg_price_room = median(price))
# plot median prices for each property type
ggplot(room_price, aes(x=reorder(room_type, -avg_price_room), y=avg_price_room)) +
geom_bar(stat="identity") +
labs(title = "Hotel rooms are the most expensive room type - shared rooms the cheapest",
subtitle = "Median price by property type in descending order",
x = "Property types",
y = "Median price MXN per night") +
theme_bw()

Next, let’s find out in which neighborhoods the listings are primarily located.
Observation
We may see Polanco and Roma Norte have the most listings with more than 1500 properties. This is because the two neighborhoods are mostly where tourists tend to cluster, and are both affluent areas. As there are higher demands for accommodation from tourists, the provision of listings is also higher than other areas. San Pedro De Los Pinos has the least listing because this neighborhood is located in the west of Mexico City and relatively rural and inconvenient in terms of transportation for tourists.
It is also noticeable that there are many listings without specified area. Since only high level perspective and many unlabeled values, we are going to have a closer look at the locations in the next step - the mapping.
# filter out 20 most common neighborhoods
top_neighbourhoods <- listings_wip %>%
count(neighbourhood, sort = TRUE) %>%
top_n(20)
# visualize it with ggplot
ggplot(top_neighbourhoods, aes(x=reorder(neighbourhood, n), y=n))+
geom_bar(stat="identity") +
coord_flip() +
labs(title = "Despite many unlabeled neighbourhoods, Polanco and Roma Norte most common",
subtitle = "Neighbourhoods with most listings",
x = "Neighbourhood",
y = "# of listings") +
theme_bw()

Mapping
Let’s have deeper look at the listings’ locations by plotting them on a map of mexico city. The color of the dots indicate the price level.
bubbles_color <- colorNumeric(palette = c("darkgreen", "yellow", "red"), listings_wip$price)
leaflet(data = filter(listings_wip)) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
color = ~bubbles_color(price),
fillOpacity = 0.4,
popup = ~listing_url,
label = ~paste(room_type,"|",
"Price: ", price)) %>%
addLegend("bottomleft", pal = bubbles_color,
values = ~price,
title = "Price of listing",
labFormat = labelFormat(prefix = "MXN "),
opacity = 5)
It becomes obvious that the prices seem way to high and undifferentiated. This is because there are a few very high outlier prices. Let’s remove outliers and look at the map again.
# look at statistics to see where to make the cut
print(fav_stats(listings_wip$price))
## min Q1 median Q3 max mean sd n missing
## 0 408 725 1316 349990 1484 4987 21824 0
# check length before making the cut
length(listings_wip$price)
## [1] 21824
# make the cut
normal_prices <- listings_wip %>%
filter(price > 100) %>%
filter(price < 2500)
# check length after making the cut
length(normal_prices$price)
## [1] 19895
Thereby, we cut out extreme outliers while keeping >90% of the values. Let’s look at the new map.
bubbles_color <- colorNumeric(palette = c("darkgreen", "yellow", "red"), normal_prices$price)
leaflet(data = filter(normal_prices)) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
color = ~bubbles_color(price),
fillOpacity = 0.4,
popup = ~listing_url,
label = ~paste(room_type,"|", "Price: ", price)) %>%
addLegend("bottomleft", pal = bubbles_color,
values = ~price,
title = "Price of listing",
labFormat = labelFormat(prefix = "MXN "),
opacity = 5)
Regression
Create and analyse target variable Y (price_4_night)
To create the new variable, we continue to use the normalized prices (calculated earlier) which exclude outliers.
# Calculate the price for 4 nights for 2 people using the normalized prices to exclude outliers
normal_prices <- normal_prices %>%
mutate(
price_4_nights = case_when(guests_included >= 2 ~ (price*4+cleaning_fee),
TRUE ~ ((price+extra_people)*4+cleaning_fee)),
log_price_4_nights = log(price_4_nights)
)
Let’s compare both distributions: without transformation vs. log transformation.
Observation
Since the non-transformed data is clearly skewed to the right, it makes sense to continue with the log data which is normally distributed. It would be better to use normal distribution here because we can then analyse correlations and make predictions more accurately.
# create a histogram to observe the distribution - without log10
ggplot(normal_prices,aes(x=price_4_nights)) +
geom_histogram() +
theme_bw() +
scale_y_log10() +
labs(
title="The distribution of price_4_nights is skewed to the right",
subtitle="Distribution of not transformed data",
x="Price_4_nights",
y="Count") +
theme_bw()

# create a histogram to observe the distribution - log10
ggplot(normal_prices,aes(x=log_price_4_nights)) +
geom_histogram() +
theme_bw() +
labs(
title="The log distribution of price_4_nights is normally distributed",
subtitle="Distribution of log transformation",
x="price_4_nights (log)",
y="Count") +
theme_bw()

Model 1
# create model 1
model1 <- lm(log_price_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating, data=normal_prices)
msummary(model1)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.690839 0.055059 139.68 < 2e-16 ***
## prop_type_simplifiedCondominium 0.064090 0.019102 3.36 8e-04 ***
## prop_type_simplifiedHouse -0.373289 0.014437 -25.86 < 2e-16 ***
## prop_type_simplifiedLoft 0.128907 0.020792 6.20 5.8e-10 ***
## prop_type_simplifiedOther -0.178072 0.017844 -9.98 < 2e-16 ***
## number_of_reviews 0.001583 0.000108 14.68 < 2e-16 ***
## review_scores_rating 0.003781 0.000575 6.57 5.1e-11 ***
##
## Residual standard error: 0.602 on 15041 degrees of freedom
## (4847 observations deleted due to missingness)
## Multiple R-squared: 0.0764, Adjusted R-squared: 0.076
## F-statistic: 207 on 6 and 15041 DF, p-value: <2e-16
autoplot(model1)

car::vif(model1)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.03 4 1.00
## number_of_reviews 1.03 1 1.01
## review_scores_rating 1.01 1 1.00
Diagnostics of model 1:
- Residuals vs. fitted: Linearly assumption is fulfilled since the residuals are randomly distributed.
- Normal Q-Q: Normality assumption is decently fulfilled, however, small deviations towards higher quantities
- Scale-Location: Equal variance assumption seems to be fulfilled.
- Residuals vs. factor levels: Occasionally, there are a few points with a high leverage impact the estimation.
- Variance Inflation Factor (VIF): Not an issue since all well below 5.
Overall: The overall model is significant (see F-statistic) and all explanatory variables are significant at least on a p = 0.01 level. However, the model only explains around 5% of the variance, which is very low.
Let’s interpret the coefficient of review_scores_rating and prop_type_simplified. Note: Since property types are factors with k level, apartment is left out and serves as baseline.
# anti-log the coefficients
condominium <- (exp(0.051008)-1)*100
house <- (exp(-0.371592)-1)*100
loft <- (exp(0.123123)-1)*100
other <- (exp(-0.165880)-1)*100
reviews <- (exp(0.001482)-1)*100
rating <- (exp(0.003389)-1)*100
cat("Condominium =", condominium)
## Condominium = 5.23
cat("\nHouse =", house)
##
## House = -31
cat("\nLoft =", loft)
##
## Loft = 13.1
cat("\nOther =", other)
##
## Other = -15.3
cat("\nReviews =", reviews)
##
## Reviews = 0.148
cat("\nRating =", rating)
##
## Rating = 0.339
Interpretation
The model predicts a 4 nights stay in a condominium to be 5.23% more expensive than in an apartment.
The model predicts a 4 nights stay in a house to be 31% less expensive than in an apartment.
The model predicts a 4 nights stay in a condominium to be 13.1% more expensive than in an apartment.
The model predicts a 4 nights stay in a other property types to be 15.3% less expensive than in an apartment.
The model predicts a 4 nights stay to become 0.14% more expensive with each additional review.
The model predicts a 4 nights stay to become 0.39% more expensive with each additional rating point
Model 2: Room_type a good predictor?
Let’s see if the room type of an airbnb enhances our model, being a good predictor.
# create model 2 (just adding room_type as a explanatory variable)
model2 <- lm(log_price_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type, data=normal_prices)
msummary(model2)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.10e+00 5.61e-02 126.64 < 2e-16 ***
## prop_type_simplifiedCondominium 4.97e-02 1.56e-02 3.19 0.0014 **
## prop_type_simplifiedHouse -1.23e-02 1.25e-02 -0.98 0.3266
## prop_type_simplifiedLoft -6.74e-02 1.71e-02 -3.94 8.2e-05 ***
## prop_type_simplifiedOther -4.65e-02 1.57e-02 -2.97 0.0030 **
## number_of_reviews 2.43e-04 8.93e-05 2.73 0.0064 **
## review_scores_rating 3.10e-03 4.69e-04 6.61 4.0e-11 ***
## room_typePrivate room 2.35e-01 3.55e-02 6.63 3.5e-11 ***
## room_typeEntire home/apt 1.00e+00 3.56e-02 28.09 < 2e-16 ***
## room_typeHotel room 8.54e-01 5.22e-02 16.35 < 2e-16 ***
##
## Residual standard error: 0.491 on 15038 degrees of freedom
## (4847 observations deleted due to missingness)
## Multiple R-squared: 0.386, Adjusted R-squared: 0.386
## F-statistic: 1.05e+03 on 9 and 15038 DF, p-value: <2e-16
autoplot(model2)

car::vif(model2)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.38 4 1.04
## number_of_reviews 1.06 1 1.03
## review_scores_rating 1.01 1 1.00
## room_type 1.41 3 1.06
Diagnostics of model 2:
- Residuals vs. fitted: Linearly assumption is fulfilled since the residuals are randomly distributed with equal magnitude and concentration numbers above and below.
- Normal Q-Q: Normality assumption is fulfilled for the values in the middle, however, small deviations towards quantities on two extreme sides.
- Scale-Location: Equal variance assumption seems to be fulfilled since standarised residuals are randomly distributed with equal magnitude and concentration numbers above and below.
- Residuals vs. factor levels: Equally distributed above and below, no extreme leverages impact the estimation.
- Variance Inflation Factor (VIF): Not an issue since all well below 5.
Overall: The overall model is significant (see F-statistic) and most explanatory variables are significant at least on a p = 0.01 level except the number of reviewing and it shows that there is no significant price difference for 4 nights between apartments and houses.This also explains that an additional reviewing will not increase the price significantly. The model explains 38.5% of the variance, which is higher than the original model. Therefore, this adjusted model is stronger when analysing the correlation between prices and other factors and makes a more accurate estimation.
Additional models: Further questions to explore
Are the number of bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights?
ANSWER: Since bedrooms, bathrooms, and beds are highly correlated, we only select bedrooms as a new predictor. Looking at the diagnostics, we realize however that there are some outliers that add avoidable errors. Therefore, we create the model again with a cleaned bedrooms variable in the next code of chunk.
# adding bedrooms variable (without removing outliers)
model3 <- lm(log_price_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + bedrooms, data=normal_prices)
msummary(model3)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.99e+00 5.57e-02 125.61 < 2e-16 ***
## prop_type_simplifiedCondominium 4.73e-02 1.54e-02 3.08 0.0021 **
## prop_type_simplifiedHouse -2.49e-02 1.24e-02 -2.02 0.0435 *
## prop_type_simplifiedLoft -1.29e-02 1.71e-02 -0.75 0.4518
## prop_type_simplifiedOther -4.12e-02 1.55e-02 -2.66 0.0078 **
## number_of_reviews 2.84e-04 8.82e-05 3.22 0.0013 **
## review_scores_rating 3.39e-03 4.64e-04 7.32 2.6e-13 ***
## room_typePrivate room 2.29e-01 3.50e-02 6.55 5.9e-11 ***
## room_typeEntire home/apt 9.45e-01 3.53e-02 26.79 < 2e-16 ***
## room_typeHotel room 8.13e-01 5.16e-02 15.76 < 2e-16 ***
## bedrooms 8.14e-02 4.14e-03 19.67 < 2e-16 ***
##
## Residual standard error: 0.485 on 15026 degrees of freedom
## (4858 observations deleted due to missingness)
## Multiple R-squared: 0.402, Adjusted R-squared: 0.401
## F-statistic: 1.01e+03 on 10 and 15026 DF, p-value: <2e-16
autoplot(model3)

car::vif(model3)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.42 4 1.05
## number_of_reviews 1.06 1 1.03
## review_scores_rating 1.01 1 1.00
## room_type 1.53 3 1.07
## bedrooms 1.10 1 1.05
After removing outliers
Comparing the diagnostics of model 3 and 4, we can clearly see the positive effect of removing the outliers. Additionally, it increases R-squared by 1% to 41%.
Moreover, to answer the question, yes the number of bedrooms is a significant predictor.
# removing outlier
normal_prices <- normal_prices %>%
filter(bedrooms < 6)
# adding bedrooms variable (with removing outliers)
model4 <- lm(log_price_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + bedrooms, data=normal_prices)
msummary(model4)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.91e+00 5.56e-02 124.18 < 2e-16 ***
## prop_type_simplifiedCondominium 4.35e-02 1.53e-02 2.85 0.00440 **
## prop_type_simplifiedHouse -2.95e-02 1.23e-02 -2.40 0.01645 *
## prop_type_simplifiedLoft 4.01e-02 1.73e-02 2.31 0.02077 *
## prop_type_simplifiedOther -3.36e-02 1.54e-02 -2.18 0.02951 *
## number_of_reviews 3.01e-04 8.75e-05 3.44 0.00059 ***
## review_scores_rating 3.39e-03 4.62e-04 7.34 2.2e-13 ***
## room_typePrivate room 2.30e-01 3.47e-02 6.63 3.4e-11 ***
## room_typeEntire home/apt 8.95e-01 3.51e-02 25.49 < 2e-16 ***
## room_typeHotel room 8.18e-01 5.13e-02 15.96 < 2e-16 ***
## bedrooms 1.62e-01 6.73e-03 24.13 < 2e-16 ***
##
## Residual standard error: 0.48 on 14981 degrees of freedom
## (4786 observations deleted due to missingness)
## Multiple R-squared: 0.411, Adjusted R-squared: 0.411
## F-statistic: 1.05e+03 on 10 and 14981 DF, p-value: <2e-16
autoplot(model4)

car::vif(model4)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.49 4 1.05
## number_of_reviews 1.06 1 1.03
## review_scores_rating 1.01 1 1.00
## room_type 1.74 3 1.10
## bedrooms 1.30 1 1.14
After controlling for other variables, is a listing’s exact location a significant predictor of price_4_nights
ANSWER: Yes, the p-value indicates that the exact location is a significant predictor. However, it does not increase the explanation power (R-squared) a lot.
# adding bedrooms variable (with removing outliers)
model5 <- lm(log_price_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + bedrooms + is_location_exact, data=normal_prices)
msummary(model5)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.85e+00 5.63e-02 121.59 < 2e-16 ***
## prop_type_simplifiedCondominium 4.35e-02 1.52e-02 2.85 0.00437 **
## prop_type_simplifiedHouse -2.67e-02 1.23e-02 -2.17 0.02981 *
## prop_type_simplifiedLoft 4.00e-02 1.73e-02 2.31 0.02083 *
## prop_type_simplifiedOther -3.12e-02 1.54e-02 -2.02 0.04326 *
## number_of_reviews 3.01e-04 8.73e-05 3.44 0.00058 ***
## review_scores_rating 3.35e-03 4.61e-04 7.27 3.8e-13 ***
## room_typePrivate room 2.35e-01 3.47e-02 6.79 1.2e-11 ***
## room_typeEntire home/apt 8.97e-01 3.50e-02 25.60 < 2e-16 ***
## room_typeHotel room 8.22e-01 5.12e-02 16.06 < 2e-16 ***
## bedrooms 1.63e-01 6.72e-03 24.25 < 2e-16 ***
## is_location_exactTRUE 7.30e-02 1.07e-02 6.80 1.1e-11 ***
##
## Residual standard error: 0.48 on 14980 degrees of freedom
## (4786 observations deleted due to missingness)
## Multiple R-squared: 0.413, Adjusted R-squared: 0.412
## F-statistic: 958 on 11 and 14980 DF, p-value: <2e-16
autoplot(model5)

car::vif(model5)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.49 4 1.05
## number_of_reviews 1.06 1 1.03
## review_scores_rating 1.01 1 1.00
## room_type 1.74 3 1.10
## bedrooms 1.30 1 1.14
## is_location_exact 1.00 1 1.00
What is the effect of cancellation_policy on price_4_nights, after we control for other variables?
ANSWER: Depending on the level of cancellation policy, it provides a significant predictor. The fact that a listing has a moderate cancellation policy makes it 12% more expensive compared to a listing with a flexible cancellation policy. Looking at the scale location of this model raises a few questions regarding the standardized residuals.
# create model considering cancellation policy
model6 <- lm(log_price_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + bedrooms + is_location_exact + cancellation_policy, data=normal_prices)
msummary(model6)
## Estimate Std. Error t value
## (Intercept) 6.82e+00 5.61e-02 121.49
## prop_type_simplifiedCondominium 4.42e-02 1.52e-02 2.91
## prop_type_simplifiedHouse -2.68e-02 1.22e-02 -2.19
## prop_type_simplifiedLoft 4.52e-02 1.72e-02 2.62
## prop_type_simplifiedOther -2.36e-02 1.54e-02 -1.54
## number_of_reviews 1.55e-04 8.79e-05 1.77
## review_scores_rating 3.24e-03 4.60e-04 7.05
## room_typePrivate room 2.39e-01 3.45e-02 6.93
## room_typeEntire home/apt 8.88e-01 3.49e-02 25.42
## room_typeHotel room 8.18e-01 5.11e-02 16.01
## bedrooms 1.62e-01 6.69e-03 24.15
## is_location_exactTRUE 7.12e-02 1.07e-02 6.66
## cancellation_policymoderate 6.68e-02 9.14e-03 7.31
## cancellation_policystrict_14_with_grace_period 1.15e-01 1.04e-02 11.06
## cancellation_policysuper_strict_60 3.06e-01 4.79e-01 0.64
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## prop_type_simplifiedCondominium 0.0036 **
## prop_type_simplifiedHouse 0.0283 *
## prop_type_simplifiedLoft 0.0088 **
## prop_type_simplifiedOther 0.1247
## number_of_reviews 0.0772 .
## review_scores_rating 1.8e-12 ***
## room_typePrivate room 4.3e-12 ***
## room_typeEntire home/apt < 2e-16 ***
## room_typeHotel room < 2e-16 ***
## bedrooms < 2e-16 ***
## is_location_exactTRUE 2.8e-11 ***
## cancellation_policymoderate 2.9e-13 ***
## cancellation_policystrict_14_with_grace_period < 2e-16 ***
## cancellation_policysuper_strict_60 0.5234
##
## Residual standard error: 0.477 on 14977 degrees of freedom
## (4786 observations deleted due to missingness)
## Multiple R-squared: 0.418, Adjusted R-squared: 0.417
## F-statistic: 768 on 14 and 14977 DF, p-value: <2e-16
autoplot(model6)

car::vif(model6)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.50 4 1.05
## number_of_reviews 1.09 1 1.04
## review_scores_rating 1.01 1 1.01
## room_type 1.77 3 1.10
## bedrooms 1.30 1 1.14
## is_location_exact 1.01 1 1.00
## cancellation_policy 1.07 3 1.01
# anit-log
(exp(1.13e-01)-1)*100
## [1] 12
Let’s compare all the models to see which one to take for the final prediction
Comparing the 6 models we have created, we can see that we continuously improved the explanation power (r-squared. Therefore, we are going to continue in our final prediction with model6.
# use huxtable to compare models
huxtable::huxreg(model1,model2, model3, model4, model5, model6)
| (1) | (2) | (3) | (4) | (5) | (6) | |
|---|---|---|---|---|---|---|
| (Intercept) | 7.691 *** | 7.102 *** | 6.992 *** | 6.908 *** | 6.846 *** | 6.819 *** |
| (0.055) | (0.056) | (0.056) | (0.056) | (0.056) | (0.056) | |
| prop_type_simplifiedCondominium | 0.064 *** | 0.050 ** | 0.047 ** | 0.043 ** | 0.043 ** | 0.044 ** |
| (0.019) | (0.016) | (0.015) | (0.015) | (0.015) | (0.015) | |
| prop_type_simplifiedHouse | -0.373 *** | -0.012 | -0.025 * | -0.030 * | -0.027 * | -0.027 * |
| (0.014) | (0.012) | (0.012) | (0.012) | (0.012) | (0.012) | |
| prop_type_simplifiedLoft | 0.129 *** | -0.067 *** | -0.013 | 0.040 * | 0.040 * | 0.045 ** |
| (0.021) | (0.017) | (0.017) | (0.017) | (0.017) | (0.017) | |
| prop_type_simplifiedOther | -0.178 *** | -0.047 ** | -0.041 ** | -0.034 * | -0.031 * | -0.024 |
| (0.018) | (0.016) | (0.015) | (0.015) | (0.015) | (0.015) | |
| number_of_reviews | 0.002 *** | 0.000 ** | 0.000 ** | 0.000 *** | 0.000 *** | 0.000 |
| (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | |
| review_scores_rating | 0.004 *** | 0.003 *** | 0.003 *** | 0.003 *** | 0.003 *** | 0.003 *** |
| (0.001) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | |
| room_typePrivate room | 0.235 *** | 0.229 *** | 0.230 *** | 0.235 *** | 0.239 *** | |
| (0.035) | (0.035) | (0.035) | (0.035) | (0.035) | ||
| room_typeEntire home/apt | 1.000 *** | 0.945 *** | 0.895 *** | 0.897 *** | 0.888 *** | |
| (0.036) | (0.035) | (0.035) | (0.035) | (0.035) | ||
| room_typeHotel room | 0.854 *** | 0.813 *** | 0.818 *** | 0.822 *** | 0.818 *** | |
| (0.052) | (0.052) | (0.051) | (0.051) | (0.051) | ||
| bedrooms | 0.081 *** | 0.162 *** | 0.163 *** | 0.162 *** | ||
| (0.004) | (0.007) | (0.007) | (0.007) | |||
| is_location_exactTRUE | 0.073 *** | 0.071 *** | ||||
| (0.011) | (0.011) | |||||
| cancellation_policymoderate | 0.067 *** | |||||
| (0.009) | ||||||
| cancellation_policystrict_14_with_grace_period | 0.115 *** | |||||
| (0.010) | ||||||
| cancellation_policysuper_strict_60 | 0.306 | |||||
| (0.479) | ||||||
| N | 15048 | 15048 | 15037 | 14992 | 14992 | 14992 |
| R2 | 0.076 | 0.386 | 0.402 | 0.411 | 0.413 | 0.418 |
| logLik | -13712.358 | -10639.219 | -10436.481 | -10271.543 | -10248.410 | -10182.210 |
| AIC | 27440.716 | 21300.438 | 20896.963 | 20567.085 | 20522.821 | 20396.419 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||||||
Best model for final prediciton
Given information:
Find Airbnb’s that are apartment with a private room, have at least 10 reviews, and an average rating of at least 90. Use your best model to predict the total cost to stay at this Airbnb for 4 nights. Include the appropriate 95% interval with your prediction. Report the point prediction and interval in terms of price_4_nights.
ANSWER:
For a private room in Mexico city with a minimum rating of 90 and at least 10 reviews, the model predicts a 95% price CI of MXN 2057 to MXN 2168. This corresponds to EUR 82 to EUR 87. Making a sanity check with the actual listings right now, it becomes clear that this price is a quite decent approximation.
# select data based on given restrictions
final_prediction_data <- normal_prices %>%
filter(room_type == "Private room",
review_scores_rating >= 90,
number_of_reviews >= 10)
# Make final price predictions and give the corresponding 95% CI
prediction_log_price_4_nights <- predict(model6, newdata = final_prediction_data, interval = "confidence") %>%
exp() %>%
data.frame() %>%
summarize(lower_bound = mean(lwr),
predicted_price = mean(fit),
upper_bound = mean(upr))
prediction_log_price_4_nights %>%
kbl(col.names = c("Lower Bound (MXN)", "Prediction (MXN)", "Upper Bound (MXN)")) %>%
kable_classic("striped") %>%
kable_styling(fixed_thead = T)
| Lower Bound (MXN) | Prediction (MXN) | Upper Bound (MXN) |
|---|---|---|
| 2048 | 2100 | 2153 |
Details
- Who did you collaborate with: Muhammad Nauman Alam Khan, Tom Invernizzi, Rayna Zhang, Jerome Billiet, Christopher Baumann
- Approximately how much time did you spend on this problem set: Approx. 15h
- What, if anything, gave you the most trouble: –